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IDEHtlPIERS 
ABSTRACT 

The discussion on criterion*- referenced (CRT) and 
norjrreferenced achieveaent testis (NRT) Is divided into two parts: 
definitions and use. The authors contrast CRTs, tests n*ich coapare 
an individual's perforaance to sotoe specified behavioral ariterion of 
proficiency, and NBTs, tests which coapare an individual's score to 
scores of ^others. They also state that any^ test scores aust be 
related 'to test content; that NPTs also pes^ss content validity; 
that allegations that standardized test publishers ignore iteo 
content in favor of itea statistics ate inaccurate: and that aU " 
achieveaent tests sjr^uld be keyed to objectives or to a specified 
content doaain. In/ deciding which test to use, the authors* state tlxat 
standafrdized achieveaeat tests saaple^a broader doaain than CRTs and 
are acre likely to help deteraine the adequacy of a curriculua than 
CRTs which ^re tailor aade to specific instructional objectives. CRT 
in terpr fetation is- described as useful for ass ^ sing mastery learning: 
for decision aaking about instructional change:' and for use in. broad 
surveys of educational accoaplishaent . HBT interpretation is said to 
be useful for rank ordering of students in specific areas of 
achieveaent and for decision aaking in assessing qualitative factors 
in addition to aastery, (HH) 
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About This Report 





William A. Mehrens 



Robert L (bel 



Practitioners are foddy pro'^cvifod with a vdrioty of 
rests that can be used to aid them in making 
instructional decisions arxi in curriculum f)lanning. 
Sonie confusion has ai'isen concerning the distinc- 
tions among these tests, and how they can be used. 
Ihe authors do inuc h to cjisfK'l this contusion. 



Drs. Mehrens^and Ebel launch the first of a new 
serifs of pa pel's which will explore^ the construction, 
mterprctation and use of tests. Subsequent papers 
will give the readers ^ Mf the opportunity todelVe 
into such topics as applicability of latent trait theory 
to the (ise of tests, using standardii^ed test results for 
instructional deciisionrmaking, mandated assess^ 
ment^ and their implications for school adminfstra- 
tion, reporting educational progress to the com- 
munity and selecting standardised tests for local use. 
Hopofully readers of Mf will let their interests be 
kngwn to this editor. 

Bill M«^hrens and Bob Fbe! are no strangers to the 
readers of Mt or to the profession of measuremenL 
• Meh rens has co-authored several prominent books 
including /V^easurcvnent anc/fva/uaf/on intducjtion 
, .ind Psychology, and is a board member of NCME, 
FJbel. a former vice-presidem of FTS is the editor of 
the fourth edition of the Fncyclopcdij ofFducation- 
cj/ f^esjcMfcV), has authored ^a new edition of his 
f s.serjf/a/s of [ duationjl M(\i\urvmcn( . and isa pist- 
president of AFRA, 
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The Tests: How Do 
We pefine Them? 



.A Current ControVei'sy' ^ 

Vleasuremqfnt specialists differ in rheir enthusiasm for 
c riter.ion rt^ferenc ed tests. Some see criterion referenced 
tt^ts (CRTs) as the modern, improved form tfiat ail 



educational achievement tests sho'uld take henceforth. 
Ih6y feel that the problems tbai have plaguedtesting with 
norm-referencecj- test^ (NRTs) will largely disappear if 
criteric)n.feferenced tests are substituted. Pupils will learn 
moie, amJ learn it better if their efforts and those of their 
teac hers are directed and evaluatecl by c riteric^n refer- 
enced tests. 

Others see a more limitcui, spt*cial. role ioi CRTs, ff 
thingstc) be learnedarerel^tively fewin number .'separate 
arid distinct, and if mastery of each specific ability is 



pos9ft^e and d^ireabJe, then criterion referenced testing 
is the form to Use, This is likely to be true of some basic 
skills in the elementary grades, and in a few other areas of 
specialized competence. It is not likely to be true, say CRT 
skeptics, of most areas of study pursued in, the upper 
grades, high schools and college. 



Some Earlier Controversies 

These differences of opinipn over the relative merits of 
alternative types of tests are nothing new. They have been 
with us for a longjong tirpe; probably for about as long as 
tests have been used. In 1845 the Boston schoolmasters 
resisted Horace Mann's proposal that written examina- 
tions be substituted for the prevalent oral examinations. 
When objective tt*sts came into use early in the twentieth 
century, {here were vigorous debates over their merits 
when compared with t^ssay tests. Later on the focus of 
controversy shifted to tests of application versus tests of 
knowledge. In recent decj^des the advantages of forma- 
tive tt*sts and teeing over summative tests have been 
argued. This paper discusses at some length the issue of 
criterion referenced vs. norm referenced tests. 

In ea'c h ot thesexrontroversie^it has been apparent to all 
but ^-he blindest of p^jirtisans that €^ach t^ype of test has its 
valut^'and its limitations, ihus the real question is not 
"Which shair we use^' but "When shall we use it?" Stating 
the issue more apprtjpriately dt>es not guarantee an easy 
resolution of it. but il does improve its chances. 



cle^ar enoujjh! If we interpret a score of ^n individual by 
corriparing his score to those of other individuals (called a 
norm group) this would be norm referencing. If we 
interpret a person's performance by comparing it tdsome 
specified behavioral criterion of proficiency, this would 
t)e criterion referencing. To polarize the distinction, we 
could say that the focus of a normative score is on how 
many ot Johnny's peers do not perform (score) as well as 
he doei; the focus of a criterion-referenced §core is on 
what it is that johnny can do. Of course we can, and often 
di^ interpret a single test score both ways. In norm 
referencing we might make a statement that "John did 
better than 80 percent of thi& students in a test on addition 
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- The Problem ofc Definition 

One of the difticulli(*s in adequately addressing the 
question of which type of te^t to use is the lack of 
consistent (ietinitions cjf some term's. Advocates of 
^criterion-referenced tests have not always defined the 
concept the same wdv. Some of them, incorrectly we 
. believe, make no disiindion between the term^ "norm 
referencecJ" and "standardi/eci." Qne of the most 
^ influentiat acJvocates of crrterfon refigfenced measure- 
ment vvrole: ^ 
I he key liistinc iron, of course, be^tween nor rn- 
referenc ed and c rilerfon-referenced me^urement is 
that in the former c ase we reference jhat is, relate; an 
individual's py^fTirmance to that of a norm group; in 
the latter casO we reference an jnciivicfuaTs f^erfor- 
mance to a criterion (Popham\ 1975. p. 130). 
Ihis K ( lear anci suggests that th^ ciistinc tion is between 
norm retert^u ed test st c>re intvrpret^}tion. and criterion 
referenc eci-test score intorprH^tian. But the same'author 
(hftrwd a c rilerion-referenct^d test as follows: "-A 
criterion-referenced tt^^t is. used to ascertain an individu- 
al's status with respect lc> a well-defined behavici^ 
domain" (Popham. 1975. p. 130). • - /( 

Perhaf^A this kmcf ot tcM is described more accurately ^s 
a "cfomain referenced achievement test;" as the author 
himself has acknowledged (Popham. 1978. p. 94). 
' Although there are mconslstenc ies with respect to how 
thevterms norm-referenced testing (or measurement) arfd > 
crrtenofi-referenced testing (or ImeasuremerK) are usqd, 
the distinction between the twb typt*s of scores seems' 
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of whole numbers." In criterion referencing we might say 
that "John got 70 perceru of the items correct in a test on 
addition of whole numbers. " Usually , We would add 
further "meaning" to th(s statement by stating whetheror 
not we' thought 70 percent was inadequate, minimally 
^adequate, excellent, or whatever. ■ •'^ , 



The Importance of Test Content 

Although measurement experts generally agree to the- 
distinction between norm-referenced ind eriterion- 
referenced interpretjition, some misunderstanding exists 
abouf the norm-referenced interpretation. There have 
been those who have said, or strongly implied, that norm- 
referenced measurement tells us nothing about 'what a 
perstjn can do, but only how the pefson compares with 
others, as if the comparison' did not involve any specified 
confent (e.g., Popham."^976; Samuels & EdwalP, WS). It is 
true that the content is specified only in very general terms 
for some standardized tests. For many others, d eta i red 
content outlines are provided by the test pubNshers. 

To l?e meaningful, any test scores must be related to test 
contertt as well as to the scores of other examinees (Ebel, 
1%2. p. 19). Any test will sample the content of some 
-specified dorViain and there is always an implicit behavior- 
al element. However, in norm-referenced measurement; 
in contrast to criterion-referenced measurement, "the 
infererice is of the form — more (or less) of trait x than the 
mean amount in * population y' — rather than some 
specified amount that is meaningful in isolation" {Jackson 
1970 p. 2). . ' 

Experts in achievement test construction have always 
stressed the importance of defining the specified content 
domain and sampling from it in some appropriate fashion. 
Thus, ail good achievement test items, be they norm or 
criterion-referenced, should be keyed to a set of 
objectives and repressnt a specified content domain. If 
they are, the test is likely to have content validity. 



Content Validity 

♦ ' ^ « 

A cfassic article .by Lennon (1956) written before the 
current interest in CRTs, discusses three assumptions 
underlying the use of content validity: 

1. The area of concern to the tester can be conceived as 
a meaningful, definable, universe of responses. 

2. A sample can be drawn from this universe in some 
purposeful meaningful fashion. 

-J. The sample and sampling process cap be defined 
with sufficient precision to enable the lest user to 
judge how adequately performance on the sample 
typiiies performance on the universe. 

The assumptions apply to both CRTs and KjRTs. One 
examines the content validity of any achievement test 
intended for a particular use by lookingr^Hong other 
' things, at the deagre^ to which these three assumptions are 
warranted. Whether or not during interpretation the 
re/g^ctjof is normative or criterion based is irrelevant. 



Now It is quite possible that $ome locally constructe<i or 
tailor-made CRTs will have better content validity for 
some purposelMhan do standardized achievement tests 
which are sometimes misleadingly called norm- 
referenced tests. But it is wrong to imply that tests which 
are either standardized or norm referenced are seriously 
or necessarily lacking in content validity. 



Content ^ferencing of Standardized Tests 

It is also wrong to imply that only rform referencing is 
available for any standardized achievement test. 
Objective-referenced analysis can be made for scores on 
such tests as the Iowa Test of Basic Skills, The Stanford 
Achievement Test, the Metropolitan Achievement Test or 
the Comprehensive Tests of Basic Skills. The Stanford 
Achievement Test publisher can provide local schools 
with print-outs that indicate for each item (and for items 
grouped by instructional objectives) the behavioral 
• objective for that item; the percent correct of that item for 
e^ch class, for the total school building, for the district, 
and for the national standardization group. The print-outs 
indicate whether the local (class, building, or system) 
percent right is significantly lower or higher than the 
national percent right. They also show the response each 
pupil chose for each item. Obviously, the Stanford 
Kovides both normative and criterion-referenced data. 



Item Content vs. Item Statistics ' - 

Advocates of criterion referenced tests sornfctimes 
a lege that standardized test publishers are concerned 
almost exclusively with the statistical characteristics of 
their test items, and that they virtually ignore content 
relevance or representativeriess. Consider this statement: 
— above all [publishers of norm-referenced testsj strive 
to protioce tests that can really spread out the norm 

f^°«^ ^Pf^u"'^'^ " 0^3lics^dded)'{Popham, 1978, pp. . 
82-82). Publishers of achievement tests wouid deny that 
this is trae, and a quick Survey of the standard textbooks in 
test construction clearly shows that such striving is NOT 
the way to build content valid achievement- tests. In 
discussing item analysis, Nirtinally (1978, p. 264), states "if 
should be emphasized that item analysis of achievement 
tests is secondary to content validity." Ebel (1972, p 394) 
and Mehrens and Lehman (1978, p. 329) make the same 
point. Most test publishers build <ests recognizing "the 
widely accepted prjority of content>validity over "good" 
Item statistics. Certainly their test manuals typically give 
more coverage to content validity concerns than to item 
analysis approaches in item selection. 



Further Definitions ' 

It is our belief that all tests (whether CRT or NRT, 
whether standardized or r\ot) should be keyed to 
objectives or should represent a specified content 
domain. Whether this process is sufficient to legitimately 




allow a "cTit(*rion reterenctnl interpretation" depends on 
how restrictive a detinition one holds for a criterion 
referenced test. Ivens (197U, p. 2) simply defined a 
criterion-referenced test js one "comprised of ittms 
keyed lo a set of behavior objectives. " Harris and Stewa(t 
(1971, p. 1) gave a much more restrictive definition: \'A 
pure criterion-referenced test is one consisting of a 
sample of production tasks drawn from a well-defined 
population ot performances, a sample that may be used to 
estimate the proportion ot performances in that popula- 
tion which the student can succeed/' Glaser and Nitko 
(1971) define criterion-referenced tests as those deliber*- 
ately construcitkl so as to yield sfcores directly interpreta- 
bfe in terms of.specitit*d performance standards^ Millman 
(1974) would use the term, "domain-referenced test" for 
the Harris-Stewart definition of CRTs, and ^h€^ term, 
"obfective-b'ased lest" for Iven s ciefinition. 

All good achievement tests (re , those with high content 
validity) are objective ba^^ed. Very few can truly be called 
domain-referenced (or tit the Harris-Stewart definition of 
a pure criterion referenced test). Most existing achieve- 
ment tests probably-tit if) the general category of what 
Millman (1974) calls a crittrion-refer^nced differential 
assessment device (CRDAp). In constriixrting such tests, 
one defines a content domain (but generally not with 
.complete specificity) and writes items measuring this 
domain. But if one uses statistical procedures to judgethe 
quality of the , items with respect to their ability to 
differentiate groups or individuals on the degree to which 
they have achieved the attribute, then one lowers the 
confidence to be placed in an inference that a student 
"knows" 75'^. ot the* items correctly-, The uncer4ainty of 
this particular inference is due to the useof empiricaldata 
in choosing items, whether those empirical data are pre- 
post t€*st differences in item difficulty, or, are the 
capabilities of the items to discriminate between good^nd 
poor students at a single point in time. 

Actually there are probably few situations where we 
need to make the pure domain-referenced interpreta- 
tion. To know that an individual can type 60 words per<^ 
minute is usefi/l data whether or not the words onjhe tests 
were randomly chosen 'from soYne totally specified 
domain of words. To know that an ^dividual can correctly 
' add 80* of tFie items on paired three-digit whole numbers 
asked on a test is useful whether or not those items were 
randomly pulled from" the total set of permutations 
possible. 

The following distinctions among test types may bring 
to a focus sorPK? of the comments that have just been 
made. 

1 . Standdrdi/ed achievement tests : These are commer- 
• cial developed and may use both normative and 

criterion referencing. They typically sample from a 
broad domain of general interest and, therefore, 
may have less content v^alidity for a specific purpo^ 
than a tailor-made achievement test developed just 
for that purpose. However, they would have more 
content validity for those who are interested in thd 
broader domain. ) 

2. Uilor-made achievement tests: These also may use 
both normative and criterion referencing. They may 
well be "standardized" with respect to administra- 
tive procedures. The primary distinction is that tailor- 
made tests are built for a specific purpose and usually 
sample from a constricted domain. 



3. 'Norm-referenceU interpretation: To add meaning to 
a person's score by comparing it to those of other 
individuals in a specified group (or groups)/ 

4. Criterion-referenced test interpretation: To add 
; meaning to a person's score by comparing it to some 
. speci'fied criterion of proficiency, 

5. Objective-referenced tests: Those that are com- 
posed of tasks keyed to a set of objectives. 

6. Domain-referenced tests: Those that consist of tasks 
that are sartipled from a well defined population of 
tasks la such a fashion that one can estimate the 
proportion of tasks in the population at which the 
student can succeed, 

^ • , 

The appropriate distinctions are between 1 and 2on the 
one hand and 3 and 4 on the other, To attempt to contrast 
1 and 4 (or 2 and 3) only confuses the issue. The closer any 
achievement test comes to fitting definition 6, the higher 
the content validity is likely to be. Few, if any, tests wiH 
ever havewperfect content validity. 
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The Tests: 
How Do We Use Them? 



standardized vs. Tailor-Made Achievement Tests 

What about the relative merits of standardized versus 
tailor-made achievement tests? Popham ^as sjjggested 
that "A growing number of educator^, frustrated because 
the rriore traditional achievement tests continue to make 
their programs appear ineffectual, are flocking' for 
salvation to these newer measures'' (1976, p. 593). If one 
accepts this statement as essentially .tfue, it raises an 

. interesting question. Do the traditional achievement tests 
correctly or incorrectly show programs to be ineffectual? 

^Are the educators turning away from traditional achieve- 
ment tests because their pupils are achieving goals not 

' tested traditionally? Is this what the supporters of 
criterion-referenced tests are likely to contend? Or is it 
because a tailor-made test avoids the possibility of 
comparisons between schools? 

It is true, as several authors have suggested over the past 
few years, that standardized achievement tests have some 
limitations. But how much do these limitations detract 
fjfom the qualities of standardized tests? Hqw seriously do 
they limit the usefulness of such tests? 

Whilfe there are many factors to look at in judging a test 
the two mo^t important are reliability arrd validity. 

'"Standardized achievement tests tend to be quite reliable. 
The type of validity of concern is content validity. 
Standardized achievement tests have substantial content 
validity for typical school curricula. Contrary to the 
impressions of some .critics, standardized achievement 
test items do measure objectives. They are based on, and 
sample, a specified content domain. That domain may not 
be specified with complete precision, and the sarn^le may 



not be completely representative. But these are limita- 
^ lions of every existing test, whether that test be called a 
standardized test, a tailor-made test, a CRT, an objectives- 
based test, or d domam-reterenced test. No genera/ 
statement about different degrees of content validity for 
I standardiied and nonstandardi/ed tests is likely to be 
accurate. How much content validity a test has for a 
particular purpose depends on how well the items 
measure the objectives and sample the domain one is 
interested in at that time. 

The charge that, the items in a standardized achieve- 
ment test do not match the specific objectives of a 

C articular instructional procedure a.s well as an ideal test 
uili to intentionally measure those, and only those 
specific objectives, is tautologic.ally true. But does that 
mean we should a/ways prefer the second test? Such an 
extreme assertion js most unlikely to-be true. Consider 
Cfonbach's. words: ^ 

In course evaluation, we need not bi^much con- 
cerned 'about making measuring instruments fit the 
curriculum. However startling this decoration may 
seem, and however contrary to the principles of 
evaluation for other purposes, this must be our 
position if we want to know what changes a course 
produces in the pupil. An id/^al evaluation would 
include measures of all the types of proficiency that 
, might reasonably be desired in the'area in question, 
not' just the sele^cted outcomes to which this 
curriculum directssubstantiafattehtion. If you wish to 
know how well the curriculum is serving the national 
interest, you measure all outcomes that might be 
worth striving for C1%3/p. 680). ^ . 

Standardized achievement tests, sampling a broader 
domain, are more likely to help answer the question of the^ 
adtx^uacy of the curhculurn than Sre tests tailor-made to 
. the specific instructional objectives. 

The question of whether to use a more riarrowly 
, focuseiJ tailor-made test or a broader based standardized 
lesljs simply a consideration of the bandwidth/fidelity 
tradeoff, ft seems foolish to assume .that narrow band- 
width and high fidelity is always the preferred approach. 



Uses for Critenon-Referenced Interpretation 

The recent support for criterion-referenced interpreta- 
tion seems to have originated in large part from the 
emphases on behavioral objectives, the individualization 
of instruction, the development of programmed mate- 
rials, a learning theory that suggests that most anybody can 
learn most anything if given enough time, the increased 
interest in using tests for certification, and a belief that 
norm referencing promotes unhealthy competition and is 
injurious to low-scoring students' self-concepts. If we can 
sptx-ify important objectives in behavioral terms, then, 
many would argue, the important consideration is 
whether a student had reached those objectives, not to 
determine his position relative t,Q other students. 

Traditionally, the principal use of criterion-referenced 
measurement has been in ''mastery (ests." A mastery test is 
a particular type of criterion-referenced test. Mastery tests 
are used in programsof individualized instruction, such as 
the Individually Prescribed Instruction (IPI) program 
(Ltndvall and Bolvin, 1%7), or in the mastery learning 
model proposed by Bloom (1%8). 



Criterion-referenced interpretations are also useful in 
making decisions about instructional programs. In order 
to determine whether specific instructional treatments or 
procedures have been successful, jt is necessary to have 
data about the attainment of the specific objectives the 
program was designed to teach. A measurfr that compares 
students to each other (norm-referenced) may not dothis 
as effectively as a measure comparing each student's 
performance to the objectives. 

Also, Criterion-Referenced measures offer certain 
henefits for instructional decision making within the 
classroom. The diagnosis of sp^ific difficulties, followed 
by a prescription of certain inWuctiona^ treatments, is 
necessary in instruction whethe(^r not one u^ a mastery 
approach to learning. Of course one must be very 
cautious about diagnosing specific diff iculties on a one to 
five item subtest. 

Finally, criterion-referenced test interpretations can be 
useful in broad surveys of ediTcational accomplishment 
such as'the National Assessment of Educational Progress 
or state assessment programs. 



Mastery Testing 

The idea of mastery learning and mastery testing is not 
new (see Washbufne, 1922; Morrison, 1926). But the idea 
has not been supported unanimously. As Baker (19F1,p. 
65) suggested, "A considerable literature relating to the 
evils of mastery tests exists, and much of the work of early 
educational psychologists was in reaction to the unreal 
requirement that a\f pupils achieve criteriorr perfor- 
mance." The basic idea of mastery learning, however, v^4^s 
revitalized with the publication of a paper by Carroll 
(1%3) entitled "A Model of School Learning." Essentially, 
the model suggests that the degree of learning is a 
function of the time the student spends on the material, 
divided by the time needed. More precisely, Carroll 
Suggested that the degreeV)f learning is some function of 
the time allowed and the perseverance of the student, 
divided by the student's aptitude for the task, his ability to 
understand the instruction, and the quality of instruction . 

Bloom (1968) agreed with the basics of the model and 
suggested that the dejgreeof learning required should be 
fixed at some "mastery" level and that the instructional 
variable should be.nr\anipulated so that all (or almost all) 
students achieve mastery. Bloom stated that ''Most 
students (perhaps over 90 percent) can master what we 
have to teach them" (Blotom, 1968, p. 1). If the model is 
correct and if people should all persevere u mil they have 
"mastered" the material, then the mastery learning model 
of instruction should be employed and mastery testing 
needs to be used to determine whether mastery has 
occurred. 

' Tentative evidence (Block, 1971; 1974) suggests that in 
many subject-matter areas all students can achieve some 
level of mastery, although — as Carroll (1971, p. 31) 

' pointed out — if the task is very difficult or depends upon 
special aptitudes, there may be a nu mber of students who 
never make il. Becoming a four-minute miler or a concert 
pianist are exShiples of such tasks. 

Excluding the extreme 5 percent of the stUdents, the 
ratio between slower and faster students in the time 

, required to master a set of objectives is about 60 to 1, 
although Bloom a/ (1971, p. 5l) and Bloom (1074, p. 685) 
have suggested that fhis may be reduced to about 3 to 1. 



T his 1 to 1 ratio is elapsed time. Bloom suggests that a more 
pret ise measure is the amount of time a student is actively 
working on a project He teels the differences on this 
variable may be reciuc ible to a- ratio of 1 5 to 1 (Bloom. 
1974, p. 688). Glaser (1968, p. 28) reported that in three 
years of. mdividually prescribed instruction in mathemat- 
ics, one student had covered 73 units, One only 13. 
Whether or no[ it is worthwhile educationally to have a 
stucJent persist for ihtw to six days, weeks, or years on a 
task that others can complete irj one dar^ week, or year is 
debatable. Perhaps they shoukJ in learning those basic 
skills that are needed frequently by almost everyone, or 
tfial must be achieve to facilitate further learning., For 
other things we attempt to teach in school, such as 
understanding of modern literature, a n'lastery model 
should probably not be emploved. There is even some 
doubt if If would work tor \LJch a nibject. As both Bloi-k 
(1971, p. 66) and BlcK)m (1971b, p. 33) painted out. mastery 
learning strategies are more effective for ck^sed subjects 
(those whose content has notrhanged for some time) and 
thost^ that emphasi/e convergent c<ither^ than divergent 
thinking. The implications fc)r education of this admission 
by mastery advocates are not always fully appreciated. 
C ronbach brought , the issue inio sharp focus: 

I tind the concept of mastery severely limiting and in 
trying to Umi out where my distress lies, I finally 
focused on one word in ttie Bloom paper: h^ states 
that mastery learning is closed- Training is closed. In 
education the problems are of>en ... I see educational 
t4*iviHopmenl as continuous anci open-ended. '*Mas- 
tery " seems to tn'^ply that at some point we get to the 
end of what is-ti) be taught (Cronbach. 1971. pp. 52, 
Si). 

Anastasi has made the same point. ' — beyonci basic skills, 
mastery testing is inapplicable or insuffic ient" (1976, p. 99). 



lisvs for Norm-Referenced Interpretations 

Most testing and test theory has been based c;n the 
norm-referenced applications. There is little* argument 
that such an approach is useful in aptitucie testing where 
we wish to iTiake differential preciic tions. It is also often 
verv Useful lo ac hievenTimt k'stmg. Mdjny educators 
wouici agree with Ljroniund y^^l^ p, 13^tatement: "In 
tneasunng the extent to vVhich pupils are achieving our 
( ourse obfe^fiv\*s, v*e have no .jbsolute standard by which 
to determine rhcMr progress. A pupiTs achieveiTient can be 
regardtnl as high or low only by comparing it with the 
aifiievernenf of ofhcM pupils." 

Accepting this view, the roleof a measuring clevice islo 
Uive Lis as reliable a rank orciering of the pupils, with 
respt'C t to tfu» achievtnneni we dre measuring, as possible 
(or at least refiablv place inclividuals into multiple 
( atc»^ories. I knowing what wc^ tip. about inciividual 
differences, ii obvious thai students will learn differ irrg 
amounts of su[)jet f nlattc»r even under a mastery -learning 
approach It may be that all students or at least a high . 
percentaije of thern, have learned a significant (»nough 
portion ot a teacher's objectives to be categori/eci as 
having "mastereci" the ess^^ntiaisof the course or unit. But 
some ot these students have learned more than others, 
and It seems wi)rthwhile to fwnploy measur^^ment , 
tec hniques that identify these pupils. In the first place. 
studerTls want and iJeserve recognition for accomplish- 



ment th^t goes beyond the minimum^ If we would 
continually give only mastery tests, those students who 
accc)mplish at a higher level would lose one of the 
importan t extrinsic regardsof lea'rg ing, that is, recogn ition 
for such accomplishments. (Of course,aCRTmigfitnotbe 
a mastery test and might provide multiple categories. The. 
more categories the more it discriminates like an NRT.) 

Perhaps a more important reason than student recogni- 
tion for discrimination testing is in its benefits for decision 
making. If tvvo physicians have mastered surgery, but one 
has mastered it better, which one do you wish to have 
operate on you? For that matter, even if two physicians 
had equally. mastered their training program, one would 
probably want some norm-referencing information about 
time to completion. If one physician is such a slow learner 
that it takes him five times as long as learn the material as 
the other one, it is probably safe to assume that after he 
has been on the job ten years, he will not be soup-to-date 
on current medical practices as the fast learner. If two 
teachers have mastered the basics of teaching, but one is a 
much better teacher, which do we want to hire? If two 
students have mastered first-semester algebra, but one 
has learned it much better (or faster, time being norm- 
referenced), which should receive the most encourage- 
ment to continue in mathematics? We probably all agree 
on the answers to these questions. However, if we have 
not employed measurement tpchnk^jues that follow us to 
differentiate between the incJividuals, we cannot make 
these types of decisions. Certainly, norm-referenced 
measures are the most helpful in fixed-quota selection 
decisions. For example, if tfiere are a limited number of 
open ings in a pilot-training schcx)l, the school would wan t 
to select the best of the applicants — even though all may 
be above spme '^mastery level." 

Excellence in any human endeavor i^ inescapably 
relative. This is as true of the learning students pursue as it 
is of the instruction a school provides. We carinot prevent 
or avoid comparisons among personsunless weare willing 
to give up the pursuit of excellence, unless we choose to 
ignore differences'^mong people, or to defy reason by 
asserting that such differences are ®f no importance. 
Those who ciisparage norm-referenced. score interpreta- 
tions bee aust^ they involve comparisons among personsor 
groups are neither soundly realistic nor beneficially 
idealistic. 



In Conclusron 

I here is a place in educational measurement for both 
norm-referenced and criterion-referenced test interpre- 
tations. The question is not which interpretation to use. 
but when tq^use each'. It is regrettable that we have mixed 
up types of tests. His regrettable that some have advocated 
local tailor-made tests, not as desirable supplements to 
external standardized tests, which they are, but as 
generally superior alternatives, which they are not. Time 
has a way of correcting such errors. May it do so soon. 

0 
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